A database of calculated solution parameters for the AlphaFold predicted protein structures

Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting “folding frenzy” has already produced predicted protein structure databases for the entire human and other organisms’ proteomes. However, rapidly ascertaining a predicted structure’s reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${D_{t(20,w)}^{0}}$$\end{document}Dt(20,w)0, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${s_{{\left( {{20},w} \right)}}^{{0}} }$$\end{document}s20,w0) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US-SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${D_{t(20,w)}^{0}}$$\end{document}Dt(20,w)0, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${s_{{\left( {{20},w} \right)}}^{{0}} }$$\end{document}s20,w0, [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed using the SESCA program. Some of AlphaFold’s drawbacks were mitigated, such as generating whenever possible a protein’s mature form. Others, like the AlphaFold direct applicability to single-chain structures only, the absence of prosthetic groups, or flexibility issues, are discussed. Overall, this implementation of the US-SOMO-AF database should already aid in rapidly evaluating the consistency in solution of a relevant portion of AlphaFold predicted protein structures.


US-SOMO-AF database and data organization background
We chose a NoSQL MongoDB (https://www.Mongodb.com) database to store the hydrodynamic calculations and metadata due to its familiarity to the authors and its native support by the chosen website framework. Structural and CD properties were kept as separate files to simplify website direct download access. All steps processing large numbers of structures were performed by creating and running command line scripts. All work was done on the PDB files and not the mmCIF files, as we knew the hydrodynamic, structural and CD calculation programs we utilized supported the PDB format.
Collecting and pre-processing the AF database entries The first step was to download (http://ftp.ebi.ac.uk/pub/databases/alphafold) the entire AlphaFold (AF) database. We initially processed the AF-v1 data with a collection of scripts. We subsequently built a pipeline (https://github.com/ehb54/somoaf-pipeline) to process the data, which will be outlined here. The pipeline is written in Perl (https://www.perl.org) and consists of six stages. Note the pipeline can be run in parallel by providing unique lists accession codes to each job. Initially a configuration script is edited (config.json) from the template (config.json.template) with appropriate paths set. Each stage takes as input a list of UniProt accession codes.

Pipeline processing stage 1 -collect data
The first pipeline stage, collect.pl, collects the FASTA sequence and UniProt Features from UniProt (the latter from the UniProt REST API, https://www.uniprot.org/help/programmatic_access) and writes these into output files used by the next pipeline stage.

Pipeline processing stage 2 -build sequence summary information
The second stage, chains.pl, examines the AF PDB, FASTA and UniProt Features and builds a chains file for each accession code for consumption by the subsequent stage. The UniProt Features of category MOLECULE_PROCESSING have types of "init_met", "signal", "propep", "peptide" and "chain". This information is manipulated to produce the chains file. From these Features, the stage makes multiple handling decisions. Unknown start and end residue sequences (represented as mismatch between FASTA and AF PDB residue count). At this stage, accession codes with multiple AF frames are dropped, as these are not useful for further computations.

Pipeline processing stage 3 -initial PDB and PDB variant creation
The third stage, pdbs.pl, reads the accession's chains file and cuts any initial methionine, transit and signal peptides. This becomes one output PDB. For any propeptides present, all permutations of removal are computed and variant output PDBs are created, each noted with a -pp# extension. For example, if the first and third propeptide are removed, the extension will be -pp1_3. Note that when removing propeptides causes a sequential residue number break, unique chain ids are assigned, (e.g. B, C), as US-SOMO's treats broken chains (residues missing) differently than intact chains, leading to differences in the bead models generated (normally, the main chain beads are always positioned at each peptide bond center of mass, but if breaks are present, the main chain beads are positioned at each CA). Note also that when post-translational modifications would result only in a set of small peptides, no additional PDB structures were generated (e.g. Q19165).

Pipeline processing stage 4 -finalize PDBs and mmcifs
The fourth stage, pdb2.pl, takes each PDB produced by the previous stage and finalizes it by adding additional records. For SSBONDs we utilized the disulphide bond identification feature recently developed in US-SOMO 25 , and modified the US-SOMO code to expose this feature via the Brookes and Rocco US-SOMO-AF database -Supplementary Information 7 command line. SEQRES records are updated to reflect the actual residues and chains present in the PDB. To add α-helix and β-sheet information into the PDB files, UCSF-Chimera 30 is used, which includes a DSSP-based algorithm 29 to identify stretches of secondary structure. Chimera adds the HELIX, SHEET and CONECT records to the PDB file. Chimera also uses the updated SEQRES record to add OXTs for cases where chain breaks were introduced or the C-terminal truncated due to the removal of a propeptide. REMARK records are added to the PDB detailing this stage's processing steps and the residues remaining. Next, the stage produces mmCIFs for the website using RCSB's MAXIT software (https://sw-tools.rcsb.org/apps/MAXIT). This required two processing runs as MAXIT needed to first convert the PDB file to a CIF file and subsequently convert the CIF file to an mmCIF file. Finally, an additional PDB is created with the temperature factors replaced by AF's confidence levels mapped to values appropriate for JSmol visualization.

Pipeline processing stage 5 -compute
The fifth stage, compute.pl, computes the hydrodynamic parameters, structural data (SAXS p(r) vs. r) and circular dichroism spectra, and assembles the resulting data. To compute the hydrodynamic parameters and structural properties, the US-SOMO software is called, which includes several modeling options and multiple hydrodynamic parameter calculation algorithms 23-25 . For this work, we chose the "SoMo with overlaps" bead modeling method, which coupled with the ZENO computational method 31-33 has produced the best matching between experimental and computed parameters for an extended set of test proteins 21,24 . The SoMo with overlaps method is based on the original SOMO method in which each main and sidechain segments for every amino acid are represented by a bead whose volume includes that of the theoretically bound hydration waters 53 . Although US-SOMO now includes the state-of-the-art GRPY method 25,54 to compute the hydrodynamic parameters of bead models with overlaps, it is significantly more computationally intensive, and its main advantage would be to produce also the rotational diffusion parameters (such as the rotational correlation time(s)), which are, however, more difficult to accurately measure. All the computations were carried out under standard solvent conditions (water at 20 °C, pH 7).

Brookes and Rocco
US-SOMO-AF database -Supplementary Information 8 US-SOMO is also used to compute the SAXS p(r) vs. r curves 47 . To allow execution by the pipeline, the development US-SOMO code was enhanced to support the execution of script files.
To compute the CD spectra from our prepared PDB files, the program SESCA 20 is used. SESCA's SESCA_main.py program is called on our PDB files producing a CD_comp.out file containing the CD spectrum.

Pipeline processing stage 6 -package
The final stage, package.pl, places all results in defined target directories to be easily uploaded to the website. All MongoDB commands are concatenated into a single file, which is used to update the database on the website.
All the above stages are run in sequence on a provided set of accession codes, making it relatively straightforward, however computationally intensive, to update the database whenever AF provides updated structure predictions. Simultaneous multiple runs of the pipeline are supported, each running on unique sets of accession codes.

Processing performance notes
The most time-consuming step in the processing pipeline is the hydrodynamic calculations which took approximately one minute per PDB on either an AMD EPYC 7742 (University of Lethbridge, shared) or an AMD EPYC 7764 (Texas Advanced Computer Center -TACC, Lonestar 6, exclusive node access). AF-v1 computations were completed exclusively on University of Lethbridge's EPYC in shared mode. However, to gain additional throughput, we enabled the pipeline on TACC's resources. A dedicated TACC dual EPYC node gave us throughput of 100 PDBs per minute, requiring approximately 165 node hours to compute the entire AF-v2 database.

Brookes and Rocco
US-SOMO-AF database -Supplementary Information 9

Website implementation
The website was created with the GenApp framework 55

DMD simulations of AF-O88338
To expand the conformational space of AF-O88338, we used US-SOMO's interface 58 56,57 . Relaxation was run for 5 ps at 0.7 kcal/mol/k B , the production was run for 5 ns at 0.6 kcal/mol/k B . The Andersen thermostat was used for both the relaxation and run stages.

Monte Carlo Molecular Mechanics
To expand the conformational space of Q4DE01, A0A060D4L2 and Q8IJG3, we utilized the Monomer Monte Carlo module of SASSIE-web 45 (https://sassie-web.chem.utk.edu/sassie2/). The flexible regions were provided as residues 1-72 and 746-957, 1-118, 1-40, respectively. The number of trial attempts were set to 20,000. For all other input fields, defaults were used. As a fraction of the structures are rejected due to steric clashes, the final counts of accepted (produced) structures were 16,520, 16,766, 16,367, respectively. Finally, we counted in each M bin the number of R s (or [η]) pairs having a % difference ≥ 6 or ≥ 9

Data analysis
(twice or three-times the estimated experimental average error, respectively). For the evaluation of the maximum Z-score 60 , the computed [η] values were placed in 2% increments confidence level bins (from 100% to 20%). For each bin, mean and SD were computed, from which the maximum Z-scores were then derived. All the computations involving binning and subsequent steps were done with custom Perl scripts.